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RANDOM-SET METHODS IDENTIFY DISTINCT ASPECTS OF 
THE ENRICHMENT SIGNAL IN GENE-SET ANALYSIS 
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A prespecified set of genes may be enriched, to varying degrees, 
for genes that have altered expression levels relative to two or more 
states of a cell. Knowing the enrichment of gene sets defined by 
functional categories, such as gene ontology (GO) annotations, is 
valuable for analyzing the biological signals in microarray expres- 
sion data. A common approach to measuring enrichment is by cross- 
classifying genes according to membership in a functional category 
and membership on a selected list of significantly altered genes. A 
small Fisher's exact test p-value, for example, in this 2x2 table 
is indicative of enrichment. Other category analysis methods retain 
the quantitative gene-level scores and measure significance by refer- 
ring a category-level statistic to a permutation distribution associated 
with the original differential expression problem. We describe a class 
of random-set scoring methods that measure distinct components of 
the enrichment signal. The class includes Fisher's test based on se- 
lected genes and also tests that average gene-level evidence across the 
category. Averaging and selection methods are compared empirically 
using Affymetrix data on expression in nasopharyngeal cancer tissue, 
and theoretically using a location model of differential expression. 
We find that each method has a domain of superiority in the state 
space of enrichment problems, and that both methods have bene- 
fits in practice. Our analysis also addresses two problems related to 
multiple-category inference, namely, that equally enriched categories 
are not detected with equal probability if they are of different sizes, 
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and also that there is dependence among category statistics owing 
to shared genes. Random-set enrichment calculations do not require 
Monte Carlo for implementation. They are made available in the R 
package allez. 

1. Introduction. In processing results of a microarray study, one is faced 
with the daunting task of relating differential-expression evidence to other 
information about the genes. Any interesting connections that can be re- 
vealed are critical in developing a fuller understanding of the biology and in 
providing guidance toward the next experiment [e.g., Rhodes and Chinnaiyan 
(2005)]. Much of the exogenous information is organized in networks of func- 
tional categories; genes are annotated to the same category by virtue of a 
shared biological property. The Gene Ontology (GO) project is perhaps the 
best example of how biological information is carried by networked collec- 
tions of functional categories [Gene Ontology Consortium (2000, 2004)]. Ini- 
tiated as a collaboration among different genome projects, GO has become 
a fundamental resource that records attributes of genes and gene products 
and that organizes these attributes using networks of connected functional 
categories. 

The problem of enrichment emerges in relating gene-level expression re- 
sults with functional categories. To what extent, if at all, are genes with 
altered expression over-represented in a named category? At the risk of over- 
simplifying things, the extensive research and development toward solving 
this problem may be classified by two statistical approaches. The first begins 
by selecting a short list of genes that are altered significantly relative to the 
cell grouping under study: for instance, genes with extreme fold change or 
with extreme value of a test statistic. The intersection of the selected list 
and the functional category is then evaluated, perhaps by Fisher's exact test 
or a variant, which scores the category highly for enrichment if many more 
selected genes than expected belong to the category [Draghici et al. (2003), 
Berriz et al. (2003), Doniger et al. (2003), Al-Shahrour, Uriarte and Dopazo 
(2004), Beifibarth and Speed (2004), Cheng et al. (2004), Zhong et al. (2004), 
Dodd et al. (2006)]. Available informatics tools and related problems are re- 
viewed in Khatri and Draghici (2005). A second approach is developed in 
Virtaneva et al. (2001) and Barry, Nobel and Wright (2005), called SAFE 
(Significance analysis of function and expression) and also in Mootha et al. 
(2003) and Subramanian et al. (2005), called GSEA (Gene set enrichment 
analysis). Briefly, expression information on all the genes under study is re- 
tained; then a permutation analysis is used to measure the significance of 
category-level statistics computed from these gene-level statistics. 

Existing tools have been effective in adding value to expression results, 
but they remain limited for evaluating enrichment signals. Analysis is sim- 
plified when considering selected gene lists, since quantitative scores from 
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the gene- level analysis are not required. But then the enrichment results 
depend on the stringency of the selection, and give equal weight to genes at 
both ends of the selected list. This problem is redressed in the SAFE/GSEA 
approach. The permutation method adopted by SAFE/GSEA refers back 
to the labeled microarray data themselves rather than to the results of the 
differential-expression analysis. There is an added computational burden in 
this strategy and also it can become ineffective when few microarrays enter 
the permutation. A technical issue, further, concerns the null hypothesis at 
work in the SAFE/GSEA permutation. It refers to the complete absence of 
differential expression rather than to the absence of enrichment. 

In this paper we explore properties of random-set methods for measur- 
ing enrichment. We adopt category-level statistics like in SAFE/GSEA, but 
we calibrate them in the same way that Fisher's exact test calibrates the 
intersection of a functional category and a selected list. That is, we cali- 
brate them conditionally on results of the differential expression analysis by 
considering values of the category-level statistic that would be achieved by 
a random set of genes (Section 2). Calculations are simplified by formulae 
for the expected value and variance of this conditional distribution, so that 
Monte Carlo approximations may not be required. Random-set scoring is 
applicable to a variety of gene- level scores; we compare two schemes em- 
pirically in a study of nasopharyngeal cancer in Section 3. One measures 
enrichment by counting the intersection with a selected gene list; the other 
considers average differential-expression evidence across all genes in the cate- 
gory. In conjunction with empirical evidence we pursue a theoretical analysis 
to compare these two category scoring methods (Section 4). We find that 
two parameters affect the power to detect enrichment, and these play out so 
that neither the selection approach nor the averaging approach is uniformly 
superior. Additionally, we show how the random-set approach facilitates si- 
multaneous inference among multiple categories. Two important issues are 
(1) how to accommodate the power imbalance caused by differently sized- 
categories, and (2) how to obtain the joint distribution of category scores in 
order to have valid type-I error rate control (Section 5). We offer approxi- 
mate analytical solutions to these problems. 

2. Random-set enrichment scoring. We describe a general method to 
score categories for enrichment with expression-altered genes. Sengupta et al. 
(2006) (especially Supplementary Data) introduced the method and de- 
scribed it briefly. It forms the basis of our approach and so here we amplify 
and clarify the presentation. The class extends Fisher's exact test by allow- 
ing a variety of gene- level scores, denoted {s g }, for different genes g. These 
may be binary indicators of extreme differential expression, but we allow 
more general quantitative expression scores. We focus initially on a single 
category C containing m genes. 
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The idea is to consider the unstandardized enrichment score X = — J2gec s 9 
as a random variable wherein the randomness comes not through the gene 
scores {s g } but rather through the set C. We are concerned, after all, with 
measuring enrichment for a specific category C compared to other hypothet- 
ical categories from the same system. It is useful to treat the random set C 
as drawn uniformly at random from the ( ) subsets of m distinct genes from 
the population of G genes, just a simple random sample drawn without re- 
placement [e.g., Cochran (1977)]. This is equivalent to a permutation scheme 
in which gene-level scores are randomly shuffled among the gene labels. Pre- 
cisely this scheme underlies Fisher's exact test in the special case that s g 
is the binary indicator of selection onto the significantly altered gene list 
(Supplementary Table 1). The random-set model is applicable beyond the 
binary case to any sort of gene-level scores, though the induced distribution 
of X becomes intractable. Rather than resort to Monte Carlo, we find that 
the first two moments of the otherwise intractable distribution are available 
analytically, and that the induced distribution is approximately Gaussian. 
These findings are the basis of our proposed standardization. Under the 
random-set model, and thus conditional on gene-level scores {s g }, 

(1) » = E(X) = E °= lS9 



G 



and 

(2) a 2 = var(X) 



m\G-l J IV G J V G 



which are easily computed from the full set of gene-level scores and the cate- 
gory size. Notably, the mean /j, does not depend on attributes of the category, 
though the variance depends on the category size m. We propose the stan- 
dardized category-enrichment score Z = (X — fi)/cr, which is a mean zero, 
unit variance score on the null hypothesis that category C is not enriched for 
differentially expressed genes. Analysis is simplified, especially in the case 
of multiple categories, because Z is computable without using permutation. 
Large values of Z favor the enrichment hypothesis. For moderate to large 
categories, central limit theory indicates that Z is distributed approximately 
as a standard normal on the no-enrichment null hypothesis. 

Enrichment scoring is enlivened by the possibility of using a variety of 
gene-level scores. We may use log fold changes, t-statistics or other local 
measures of differential expression. In the special case where {s g } are the 
ranks associated with gene-level scores, we get a version of the Wilcoxon 
test for enrichment, since ml is a sum of ranks, and both fj, and a 2 simplify 
as 

G + l 2 (G-m)(G + l) 
^ = ^~' a= 12^ ' 
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with suitable adjustments for ties. Gene-level scores from an empirical Bayesian 
analysis might be posterior probabilities of differential expression [Kendziorski et al. 
(2003)], in which case ml equals the posterior expected number of altered 
genes in the category, and Z calibrates this relative to the population of 
genes. Section 3 develops an example in which {s g } are transformed Spear- 
man correlations between host genes and the expression of a particular viral 
gene. Efficiency and approximate normality of the Z score will be improved 
if the distribution of gene-level scores is suitably regular. For instance, it is 
preferred to use log transformed p-values instead of p-values, and log fold 
instead of raw fold change. 

Another important special case happens when {s g } are binary scores indi- 
cating selection to a short list of significantly differentially expressed genes. 
Then Z 2 = (— ^)U, where U is Pearson's chi-squared statistic for testing 
independence between category and short-list assignment [calculations not 
shown, but following, e.g., Bickel and Doksum (2001), page 402]. To a minor 
approximation, then, our proposed Z score corresponds to Fisher's or Pear- 
son's test when {s g } are binary gene-level scores. Otherwise it generalizes 
those category scores and measures other aspects of the enrichment signal, 
as we demonstrate next. 

3. An analysis of host/virus associations in cancer. A recent expression 
study of nasopharyngeal carcinoma (NPC) used the proposed methodol- 
ogy for category enrichment [Sengupta et al. (2006)]. NPC is a cancer of 
the nasopharynx that is responsible for 60-70,000 deaths per year world- 
wide. Nearly all cases are associated with the Epstein-Barr virus (EBV) 
infection, though the molecular determinants and the nature of the host- 
virus interactions remain poorly understood. Sengupta et al. (2006) studied 
tumor tissue from n = 31 NPC patients using Affymetrix hgul33plus2 mi- 
croarrays to measure host gene expression and using RT-PCR to measure 
the expression of 10 viral genes. The hgul33plus2 microarrays probe the 
transcriptome with G =54,675 probe sets. Here we reconsider associations 
between host expression and the expression of the single viral gene EBNAl. 

The statistical analysis of host-virus association rests on pairwise Spear- 
man correlations between individual Affymetrix probe set values and the 
expression of EBNAl. Supplementary Figure 1 shows one probe set and 
its correlation with EBNAl. Supplementary Figure 2 shows correlations 
with EBNAl for all host probe sets. The most extreme negative correla- 
tion is r = —0.75, which is unusually small (Supplementary Figure 3, p- 
value = 0.04). A striking feature of the empirical distribution of correlations 
is that 65% of host probe sets are negatively correlated with EBNAl. This is 
significantly more than expected if truly there is no association between host 
and virus expression (p- value = 6 x 10~ 4 , Supplementary Figure 3). Glob- 
ally, there is evidence for significant negative association between EBNAl 
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Fig. 1. Gene selection. For the host-virus association example, plotted is a histogram 
of the 54, 675 Spearman correlations between expression of each Affymetrix probe set and 
the expression of the viral gene EBNA1. Correlations are computed using microarray data 
on 31 tumor samples as described in Sengupta et al. (2006). Highlighted on the left are 
selected probe sets that have significant negative correlation with EBNA1 according to a 
q-value analysis that targets a 5% FDR gene list (correlation less than — 0.55,). 



expression and the expression of host genes in NPC. Figure 1 highlights a 
selected list of the 574 most significantly altered host probe sets; the list 
targets a 5% false discovery rate (FDR) according to the q-value method of 
Storey (2003). In this calculation p- values were obtained by recalling that 

(3) S9 = \^~^ l0g YTVg 

is approximately standard normal in the absence of a true correlation be- 
tween g and EBNAl [Fisher (1921)]. The sign change employed (compared 
to the usual inverse hyperbolic tangent transform) means that genes which 
correlate negatively with EBNAl have a positive gene score s g . Naturally, 
we may examine the genes on this selected list, but a study of functional 
categories that are enriched for negatively associated genes exposes more of 
the relevant biology. 
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Fig. 2. Random-set scoring. Shown are results of two category-scoring methods applied 
to 2161 GO categories for which m > 10 and based on the gene-level correlations from Fig- 
ure 1. Both methods aim to detect enrichment of the category for genes that are negatively 
associated with EBNA1 viral expression. Category GO:0019883, " antigen presentation, en- 
dogenous antigen" contains m = 48 probe sets and scores highly by both methods (large red 
dot). The large (771 = 1494,) "immune response" category (GO:0006955) scores extremely 
highly by Z avc but not so highly by Z sc i (green dot). Shown in red are categories that are 
subsets of "immune response." The average correlation of immune response genes with 
EBNAl is extremely significant, but the number of significantly negatively correlated genes 
is less significant. 



Figure 2 summarizes two category-enrichment scoring methods applied to 
all GO categories (2761) containing at least m = 10 annotated hgul33plus2 
probe sets. (This used the October 2005 build of Bioconductor package 
hgul33plus2.) Many probe sets were unannotated, and to avoid potential 



M. A. NEWTON ET AL. 



biases, we restricted attention to the universe of G = 27,152 annotated probe 
sets [Al-Shahrour, Uriarte and Dopazo (2004)]. The two enrichment scoring 
methods are conditional Z scores as described in Section 2. The first, Z ave , is 
based on gene-level transformed correlations s g from (3). The second, Z se \, 
is based on binary scores l[s g > k], where k defines the 5% FDR list of the 
most significantly negatively correlated host genes. Recall that Z sc \ is the 
normal-score version of Fisher's exact test. Since each Z score is nominally 
standard normal in the absence of enrichment, Figure 2 seems to indicate 
that many GO categories are enriched for altered genes. Reference lines at 
z = 5 are drawn for guidance (nominal p-value < 10~ 6 ). A noteworthy fea- 
ture in Figure 2 is that Z se \ and Z avc are not perfectly correlated. They 
capture different aspects of the enrichment signal, and thus, they deserve 
separate consideration. Some categories have high Z se \ but negligible Z avc . 
They are enriched for genes on the short list of most significantly negatively 
correlated genes, but the average correlation is not unusual. Other categories 
have high Z ave but negligible Z se \. These would not be detected by Fisher's 
test, for example, though on the average the negative correlation exhibited 
by the contained genes is extremely unusual. 

That Z se \ and Z avc capture different aspects of the enrichment signal is 
exemplified by the immune response category, GO:0006955, which connects 
m = 1494 probe sets on the hgul33plus2 microarray. Recall that a signif- 
icant mass of host probe sets are negatively correlated with the EBNA1, 
though most of these do not occupy the 5% FDR selected list of most sig- 
nificantly altered host probe sets. Among the 2761 GO categories are many 
(marked in red) that are subsets of GO:006955; that is, they represent spe- 
cific forms of the immune response. Notably, all these subsets have Z avc > 0, 
which indicates that their average correlation with EBNA1 is more nega- 
tive than average. Taken together, we get strong evidence of enrichment by 
Z ave . At the same time, many of the subsets have Z sc \ < 0, which indicates 
that they have less representation on the selected list than we expect, and 
thus, selected probe sets are not particularly over-represented in the immune 
response category. 

Sengupta et al. (2006) followed up on some of the categories that showed 
both extreme Z se \ and extreme Z avc , such as GO:0019883, which is in the 
biological process network, with GO term antigen presentation, endogenous 
antigen. This category is marked up in subsequent figures. There are m = 48 
probe sets annotated to this category, and x = 8 occupy the selected list, 
giving z sc \ = 10.6. Also, the average correlation with EBNA1 is unusually 
low, with z avc = 7.68. An informal look at the short list of 574 significantly 
altered probe sets probably would not have revealed a preponderance of 
GO:0019883 genes. Indeed, the best ranking is at the 90th position, there are 
only three probe sets in the top 250. Followup experiments on the genes in 
GO:0019883 confirmed the negative correlation findings that were suggested 
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by the enrichment analysis. Ongoing research aims to understand whether 
viral EBNA1 is taking advantage of host cells that have disabled antigen 
presentation function, or whether the virus is effecting a change in the host 
expression itself. 

It is routine that named genes are associated with multiple probe sets 
on an Affymetrix microarray (Supplementary Figure 4A). GO:0019883, for 
example, represents only m g = 12 genes though it has m p = 48 probe sets. 
The fact is important for enrichment calculations since we ought to avoid 
spurious findings that reflect over-representation of certain genes in the sys- 
tem rather than biologically significant enrichment. Various solutions are 
available. Ideally we would first reduce the probe set data to the gene level, 
and then proceed with enrichment calculations on this reduced space (giv- 
ing Zidcal)- A computationally much simpler adjustment is suggested by the 
variance formula (2). It uses the probe set based Z score, and the numbers 
m g and m p to compute 



The rationale is that X and \i may not change much in the reduction 
step; most of the effect will be on the variance. The naive, ideal (reduce 
by median) and adjusted enrichment scores are compared in Supplemen- 
tary Figure 4B, C. The ideal scores tend to be more conservative than 
the naive ones; GO:0019883 remains impressive with Zideai ; ave = 4.78 and 
£ideal;sel = 11.3 (latter not shown). Globally, the adjustment (4) is similar 
to the ideal score and it tends to be conservative. In the example category 
GO:0019883, ^ a djust ; ave = 3.84 and z a djust;sei = 5.3. Thus, effective approxi- 
mations accommodate the multiple probe sets per gene problem. In some 
cases it may be possible to select the most reliable probe sets from among 
the multiple probe sets associated with a gene. For example, a probe set 
showing high average intensity and high dynamic range across multiple tis- 
sue types may be better than one measuring near the background signal in 
most samples. We do not address the selection of probe sets in this paper, 
but if probe sets are selected for some genes, equation (4) could be applied 
to the selected probe sets. 

For further comparison, we applied the SAFE procedure [Barry, Nobel and Wright 
(2005)] to all 2761 GO categories using the original microarray data and 
EBNA1 expression data on all 31 tumor samples. We adopted the same 
category-level statistic in order to control the comparison. Specifically, gene- 
level Spearman correlations were transformed to ranks to be used as s g val- 
ues, and the category-level statistic X was the average rank (ranks were rela- 
tive to the 27,152 probe sets having some annotation). Results for GO:0019883 
are summarized in Figure 3. Visually, the category appears to have a pre- 
ponderance of negative correlations with EBNA1 (Panel A), and this is 





10 



M. A. NEWTON ET AL. 



.1 








































» MM 1GMQ I S4BD SOQSfi ^XO 



S 

[3- 



EOCC t HH0 I6WW HBW 




Com leccn HOB Xjtyn Ssciid 1:01 i>noa i4icc n»3a ihaz< 

lii«t:-', imu rtc ctctqory xuibtK 



Fig. 3. Comparison with SAFE/GSEA: Panel A is a rank plot of probe set correlation 
scores for the host-virus example. The positions of the m = 48 probe sets from GO:0019883 
are marked, and suggest a preponderance of correlations on the negative side. The average 
rank is noted with an arrow (5478). Panel B compares the empirical distribution functions 
(e.d.f.s) of two different null distributions for the average rank statistic (red — random sets; 
blue — SAFE). Associated histograms are in panels C and D, but the e.d.f.s reveal the scale 
difference more dramatically; B = 10,000 random draws are used in each case. Recall that 
the random set (red) distribution is obtained by shuffling the GO:0019883 ranks in panel 
A. By contrast, the SAFE (blue) distribution returns to the original data and shuffles chip 
labels (as in Figure 3, Supplementary material). The p-value from SAFE is p = 0.02 and 
from random sets is p < 10~ 10 (z = — IX) based on a normal approximation to panel D. 



supported by both statistical calibrations. Yet random-sets and SAFE eval- 
uate the significance of the same average-rank statistic rather differently. 
Compared to average ranks obtained on random, same-sized sets, the aver- 
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age rank for GO:0019883 is extremely unusual. Compared to the statistic 
we would compute on GO:0019883 if viral expression is not associated with 
host expression, the observed average rank is modestly significant. A similar 
pattern recurs for many categories (Supplementary Figure 5). The two cal- 
ibration approaches agree broadly but differ substantially in their ranking 
of categories, which suggests that the distinct enrichment signal is identified 
by the random-set approach. 

4. Averaging or selection? A theoretical comparison. In the preceding 
case study we observed empirical characteristics of two random-set methods 
for scoring category enrichment. The selection approach begins with a short 
list of extremely altered genes and asks if there is over-representation in the 
category. The averaging approach scores the category simply by averaging 
gene-level evidence across all genes in the category. The associated category 
Z scores exhibit some positive correlation, but evidently they capture dif- 
ferent components of the enrichment signal. Some theoretical findings are 
available which expose properties of the enrichment testing problem. 

Our findings are developed in the context of a generic mixture model, one 
that is structurally similar to models commonly described in the microar- 
ray literature. The model is presented in order to develop a comparison 
of category scoring methods. It is not used for the analysis of data per 
se, but it sheds light on an interesting phenomenon created by this two- 
level (gene/category) inference problem. We find that each category-scoring 
method has its own domain of superiority in the state-space of enrichment 
problems; neither is always preferred. The result is somewhat surprising since 
information is obviously lost in the selection approach and not so obviously 
lost by averaging evidence. The result is related to the debate in statistics 
about model selection versus model averaging. 

Consider genes g & {1,2, . . . ,G} and quantitative gene-level scores {s g }. 
The larger s g , the more evidence for differential expression of gene g. A 
category C is a known subset of m < G genes sharing some particular bio- 
logical function. The category may be scored for enrichment by one of two 
statistics: 



To enable a comparison of the category scores, we frame the problem 
as a test of the null hypothesis that C is not enriched. More specifically, 
suppose that each gene g is either truly differentially expressed (I g = 1) or 
not (I g = 0) between the two cellular states. We allow that some fraction 




averaging 



selection 
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of genes are truly differentially expressed. The category C itself contains a 
fraction 



of differentially expressed genes. No enrichment means Hq : ttq = tt, and this 
is tested against the alternative H\ : ire > vr. We can define enrichment sim- 
ply as ttq — vr. Lack of enrichment does not mean there is no differential 
expression; it just means there is not more than in the whole system. (One 
could also adjust for discreteness of nc, but the adjustment would be negli- 
gible for modestly large category size m, and so it is not pursued.) Statistics 
X sc \ and X avc are two possible test statistics for testing Hq. We compare 
their power against various alternatives. 

The latent differential expression indicator I g affects the distribution of 
the gene- level score s g . A simple location model asserts that s g is normally 
distributed with unit variance and with mean SI g for a gene-level effect 
5 > 0, and that all variables are independent. Normality is often reasonable 
for suitably transformed gene-level scores, such as log 2 (fold) or the trans- 
formed correlation (Section 3). Differential expression could potentially alter 
the variation of scores, but as a first approximation we focus on the location 
shifts only. The possible effects of among-gene dependence are important, 
but they are secondary in the present comparison of enrichment-scoring 
methods, hence, our demonstration is in the independence model (see dis- 
cussion). 

A test based on the average quantitative score X ave uses the sampling 
distribution Normal (#7rc, 1/m). Thus, the power of a level a test is 1 — 
$(r ave ), where <£(•) is the standard normal cumulative distribution and 

(5) r avc = z a - y/m (ir c - 7r) 6 . 



Naturally, the power of the averaging approach increases with effect, enrich- 
ment and category size. One scenario is presented in Figure 4B: a category 
of size m = 20 is tested for enrichment at level a = 0.05 in a system with 
7T = 0.2 of genes differentially expressed. 

The power of the selection approach is similarly derived. It entails a nor- 
mal approximation for X se \ that is well justified in large categories by the 
central limit theorem. The power is 1 — 3>(t sc i), where 





enrichment 



(6) 




y/m(irc - 7r) - <£(fc - S)]/a(irc) ■ 



■a 



a(7r c ) 



enrichment effect 



Here k = k(TT,5,a) is chosen to deliver a FDR-controlled gene list at level 
a (see Appendix). Also, the variance function g 2 (-kc) records the variance 
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of y/mX 8e \. Figure 4A shows how the power to detect enrichment by the 
selection method is affected by ttq — tt and 5 for categories of size m = 20, 
when 7r = 0.2 and a = a = 0.05. 

The power surfaces in Figure 4 reveal an intriguing phenomena in en- 
richment testing. Both selection and averaging increase in power as either 
enrichment ttc — vr or effect 5 increase. However, they increase differently, 
creating domains of superiority for each approach. The lower panels in Fig- 




& s 

Fig. 4. Power comparison. Power to detect enrichment is shown for a category of size 
m = 20 in a system with n = 0.2 differential expression overall (A and B). Selection and 
averaging are being compared, with low power in red and power increasing into the yellows 
and whites. The lower panels show domains of superiority for each method (by imaging 
the threshholded difference on a —1, 1 scale). The enrichment is nc — 7r, and the effect 
of differential expression per gene is S. The maximum power differential is 0.46 (C) and 
0.74 CD). 
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ure 4 show these domains for the case indicated. When the ttq — vr is small, 
but 5 is large, then it is better to use the selection approach. On the other 
hand, if 5 is small, but ttq — vr is large, then it is better to average evidence 
across all genes in the category. The fact that selection can be superior is 
somewhat striking, since it entails a significant amount of information loss; 
each gene score is replaced by a binary indicator of whether or not the score 
is extremely large. On the other hand, if enrichment is weak, then averaging 
evidence combines a lot of noise with signal, thus, diminishing power. 

The nonlinearity of the power functions complicates a general comparison, 
but we have identified sufficient conditions for one or the other approach 
to be superior (see Appendix). To state the result, first put k = ct7r/{(l — 
— 7r)}, where, again, ir is the proportion of differentially expressed genes 
in the whole system and a is the FDR of the gene list used by the selection 
approach. We require < k < 1, else it is not possible to have the desired 
FDR control; this is a weak condition, since it is implied if both < a < 1/2 
and < 7T < 1/2, for example. 

Theorem 1. // 2$~ 1 ( T ^) < 5 < ^= - \fn, then for sufficiently large 
m, selection is more powerful than averaging. Also, given any ttc > vr, there 
exists 5*(ttc,7t) such that if < 5 < d*(-Kc,Tr), then for sufficiently large m, 
averaging is more powerful than selection. 

This is a finding about sub-optimality of two enrichment detection meth- 
ods. In using a selection approach, there is limited power to detect enrich- 
ment when the category under test contains lots of genes that are altered by 
a small amount, reflecting the fact that these genes are not selected as the 
most significantly altered ones. By contrast, selection is superior to averag- 
ing if the category under test is enriched for a small number of highly altered 
genes. We note that the interval for 5 in the first claim is nonempty when 
k is sufficiently small (say, smaller than 0.133). We also note that averaging 
is superior both when 5 is very small and when 5 is very large. Interesting 
power dynamics emerge in the interior of the state space, as, for instance, 
in Figure 4. 

5. Simultaneous inference with multiple categories. An unsolved prob- 
lem with enrichment calculations concerns the comparison of many cate- 
gories that vary in size. On the null hypothesis of no enrichment, each Z 
score is well calibrated by design, with zero mean and unit variance. But 
many categories may be enriched, and unlike simpler genomic testing prob- 
lems, there is different power associated with these different tests. The dis- 
tribution of Z under the enrichment hypothesis is a function of both the 
unknown enrichment ttc — ^ and the known category size m. For instance, 
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in the location-shift model, Z ave has unit variance and mean y/m(irc — 7r)<5. 
Owing to this size effect, the ranking of categories by Z alone may not 
be optimal since large enriched categories will tend to have much larger 
Z scores. The phenomenon is illustrated in the host-virus example in Fig- 
ure 5A. The problem is not limited to Z avc ; it occurs too with Z sc \, especially 
when the selected set is relatively large (data not shown). A partial solution, 
demonstrated in Figure 5B, is to rank categories according to Z/yJrn; then 
categories are ranked by the estimated enrichment ttc — 7r. In itself this does 
not provide an error-controlled list of enriched categories; nor does it account 
for the nonconstant variance of Zj^fm, but the ranking may be calibrated 
and remains useful for prioritizing categories across the GO networks. 

A complete solution to simultaneous multiple-category testing will in- 
volve the joint distribution of category statistics, rather than their marginal 
distributions considered so far. Without developing calculations fully here, 
we note that the joint distribution of Z scores across categories (condi- 
tional on {s g }) is accessible by an analysis of intersections among the differ- 
ent categories. Proper-subset information, for example, is provided by the 
directed graphical structure of GO. Two elements of the random-set ap- 
proach simplify the analysis of multiple categories. First, the permutation 
perspective described in Supplementary Table 1 carries over readily to mul- 
tiple categories. The only difference is that we have an additional row for 
each category. The multiple-category information is equivalent to a (compli- 
cated) cross-classification of genes, and a permutation of the {s g }, with the 
remaining tabulation fixed, is enough to generate the full joint distribution 
of category statistics (Supplementary Table 2). Second, valid and readily 
computed approximations to the joint distribution of category statistics are 
available. By restricting to moderate and large categories, Z scores are ap- 
proximately multivariate normal. The dependence is carried completely by 
between-category correlations, which can be computed by basic random-set 
methods [see Newton (2007)]. We find that if Z\ and Z 2 are standardized 
category scores for two categories of sizes rrt\ and mi which have an overlap 
of 7711,2 genes, then 



For large G7, this is approximately m\^l \Jm\m2. In other words, depen- 
dence is induced by the overlap of categories, and increases the larger is this 
overlap. 

A full analysis of the multiple-category testing problem is beyond the 
scope of this paper. However, we illustrate the utility of the correlation 
formula (7) and the multivariate normal approximation to describe one pos- 
sible approach. Suppose there are k categories under study. On the global 





y/mim 2 (G - mi)(G - m 2 ) 
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Category s\za 




Category $\/ u 

Fig. 5. Power imbalance. Large enriched categories may get a strong Z score by virtue of 
their size m more so than by the level of enrichment ire — t (panel A). The simple correc- 
tion Z I y/m provides a ranking of interesting categories that is adjusted for this size effect 
(panel H). Eleven categories, including GO:0019883 (red dot), have enrichment scores 
exceeding 1.10 (dashed line). These are significant at the 5% level by the maxT multiple 
testing correction computed from a multivariate normal simulation of category enrichment 
scores. Table 3 [Supplementary material] lists characteristics of these 11 interesting cate- 
gories. 

null of no enrichment for any category, the vector (Z\, Z2, ■ ■ ■ , Zk) of enrich- 
ment statistics is approximately Gaussian with mean zero, unit variances 
and covariances in (7). Ideally we would simulate the exact distribution by 
permuting gene scores as in the Supplementary Table 2, but the Gaussian 
provides a computational solution that is more convenient, especially with 
large k. By a Cholesky factorization of the known covariance matrix (even 
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when it is less than full rank) , we can easily simulate the multivariate normal 
vector. Sampled vectors respect the joint distribution of scores, for instance, 
as affected by category overlap, and provide input to various multiple testing 
schemes. For the NPC example, we report results of the maxT procedure 
[Dudoit et al. (2003)] when the categories are ranked by T = Z/y/m to ac- 
commodate the effects of category size on power. For each simulated vector, 
we computed the maximum of the T statistics across categories. Since we did 
not get Z's by permutation, we did not need to recompute category statis- 
tics; rather we simply converted the Z J s to T"s using the category sizes. In 
the NPC example, 11 categories had a T value exceeding the 95th percentile 
of the maxT null distribution. A global view of the results is in Figure 5. 
Supplementary Table 3 summarizes the 11 interesting categories. 

6. Discussion. In analyzing functional categories related to nasopharyn- 
geal cancer tissue, Sengupta et al. (2006) used the random-set enrichment 
method discussed here, with both binary selections from gene-level scores 
and averages of gene-level scores. The Supplementary Material associated 
with that paper presented the method; here, we have amplified that discus- 
sion, derived formulas (1) and (2) for standardization, evaluated the method- 
ology empirically and theoretically, and provided comparative analyses. Evi- 
dence shows that the proposed category-scoring methods capture previously 
hidden components of the enrichment signal. Results are also provided that 
guide simultaneous inference across multiple categories. 

The random-set calibration approach is the same one that underlies Fisher's 
exact test for independence between the selected gene list and the category 
under study. The simplicity of Fisher's test makes it compelling, but the 
test is limited by its focus on selected gene sets. Transferring random-set 
calculations to quantitative gene-level scores is complicated by the fact that 
the Fisher-test hypergeometric distribution no longer applies; an intractable 
distribution takes its place. In deriving the random-set mean and variance of 
a category score, we offer an easily computed approximation and standard- 
ized statistics for measuring category enrichment. This has several practical 
advantages over other schemes that use quantitative scores. By condition- 
ing on results of the differential expression analysis, our calculations can 
handle a wide variety of output from that analysis and we need not re- 
visit the raw data. Methods such as in Barry, Nobel and Wright (2005) and 
Subramanian et al. (2005) calibrate category scores by recomputing the dif- 
ferential expression results over random permutations of raw data. Not only 
can this be limited when the number of microarrays is small, but also the 
null hypothesis at work for such a permutation is the exchangeability of 
microarray labels, which asserts the absence of any differential expression. 
Insofar as enrichment concerns excess differential expression in a category 
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rather than its absence, the random-set approach may be targeting enrich- 
ment more directly. Certainly the calibration is such that different aspects 
of the enrichment signal are being detected by the random-set approach. 
Further comparative analyses are warranted. 

The issue of among-gene dependence is a subtle one that is relevant in 
enrichment calculations. The SAFE/GSEA permutation guards against ill- 
effects of such dependence by shuffling microarray labels and fixing whole 
profiles. Random-set scoring guards against these effects by conditioning on 
results of the differential expression analysis; since {s g } are fixed, whatever 
factors caused them to be dependent cannot enter the calculation. The flip 
side is that our interpretation of random-set Z scores is focused on compar- 
ing the category statistic in hand to its hypothetical value from a random 
set (as opposed to its value on some hypothetical rerun of the whole expres- 
sion study). Indeed, the dependence that in sampling theory terms would 
inflate the variance of the category score and speak against random sets is 
precisely the dependence that we aim to detect as the enrichment signal. 
Random-set calibration gains by its simple interpretation; still, one must 
take care that significant findings are attributable to biologically relevant 
enrichment rather than to something spurious. 

In the context of a location-shift model for differential expression we com- 
pared two random-set methods in terms of their power to detect enrichment 
in a given category C. A clear picture of the sub-optimality of each method 
is available, pointing to the benefits of using several methods together to 
identify different aspects of the enrichment signal. When C contains lots of 
modestly altered genes, then averaging evidence from all genes in the cate- 
gory is more powerful than selection; when C is enriched just slightly, but 
the alteration effect is high, then the use of selected genes is preferred. 

The random-set approach becomes feasible in multi-category inference be- 
cause various properties can be explicitly calculated. Ranking categories by 
their Z score normalized by the square root of category size gives a ranking 
based on estimated enrichment. This partially addresses the problem that 
large categories that are enriched at all will be detected with high probabil- 
ity, contrary, perhaps, to the aim to identify the most enriched categories. 
The joint distribution of Z scores across multiple categories is induced by 
shuffling in a certain contingency table, and it is approximately multivari- 
ate normal under the complete null hypothesis. The dependence between 
scores is mediated by the extent of category overlap [equation (7)]. Multiple 
testing schemes can take advantage of this known dependence to assure the 
identification of significantly enriched categories, though optimal schemes re- 
main to be worked out. We demonstrated the single-step maxT approach in 
Section 5 and produced useful findings for the NPC study. Refinements are 
surely possible, either using stepwise approaches to control family- wise error 
or using one of the methods for false discovery rate control. It is tempting 
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to convert the Z scores to p- values and adopt one of these standard multi- 
ple testing adjustments, but size and dependence issues complicate a simple 
technology transfer. Further work in this direction may help to sort out re- 
porting protocols when, for instance, multiple nodes in one branch of the 
GO graph exhibit varying degrees of enrichment. 

APPENDIX: SELECTION VS AVERAGING (THEOREM 1) 

Which category-testing approach is more powerful depends on the sign 
of A = r se i — r avc , the difference between the Z-score critical values corre- 
sponding to level a tests. Averaging is better if A > 0; selection is better if 
A < 0. Combining (5) and (6), 

(8) A = Z J4^-l\- V^drc ~ tt)!^? " « 



a(7r c ) J I o-(ttc) 

where /io = 1 — <3?(/e) is the mean of the Bernoulli trial l[s g > k] when g is not 
differentially expressed, [i\ = 1 — &(k — 5) is the mean when g is differentially 
expressed, ttq — vr is the enrichment, 6 is the effect of differential expression 
on s g , m is the category size, $(•) is the standard normal cumulative dis- 
tribution function, and <r 2 (-) is a variance function. In this treatment gene 
scores s g are random, and the tests are marginal rather than conditional. 

The threshhold k = k(ir, S, a) can be chosen to target a level a false dis- 
covery rate (FDR) owing to properties of the function 

m- 1_ * (x) 



for each fixed 5 > 0. By examining the derivative of h{x) and invoking Mills' 
ratio [e.g., Gordon (1941)], one can prove the following: 

Lemma A.l. h(x) is monotone decreasing with lim x ^ OQ h(x) = and 
linx^.oo h(x) = 1. 

Thus, h(x) is invertible, and with fixed k = an/{(l — a){\ — tt)} € (0, 1), 
we can certainly find k uniquely satisfying h(k) = k. This k, furthermore, 
must define the threshhold for the level a FDR gene-selection procedure, 
since, by Bayes rule, 

(9) a = P(H \ Sg >k) MO--*) 



Hq(1 -7r) +/X17T 



where fio = 1 — <&(k) and [i\ = 1 — <fr(A; — S). Reorganizing (9), we get the 
defining relation h(k) = Ho/m = k. The 1-1 correspondence between /io (or 
/ii) and 8 is further revealed through 5 = <I>~ 1 (1 — /io) — ( &~ 1 (1 — Ho/n). Both 
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Ho and Hi converge to as 5 — > 0. As 5 — > oo, Ho — > k and /ii — > 1. Always 
= «Mi- Supplementary Figure 6 provides another view of these objects. 
The behavior of the variance a 2 (-Kc) = Mo(l — Mo) + 7r c{Mi(l — Ml) ~~ Mo(l ~~ 
/^o)} depends on the sign of the coefficient of wc- Indeed, there is a critical 
point = + k) < 1/2, when this coefficient equals 0, and a 2 (7Tc) — 
k/(1 + k) 2 for all ttc- For /io > + /c), a 2 (irc) is decreasing in 7Tc; for 
Ho < k/(1 + k), it is increasing in 7rc. Noting the 1-1 correspondence of 5 
and no, we see the critical point occurs at 5 = 2<E> _1 (y^r). 

Consider the first claim on superiority of selection. We seek conditions 
under which A < 0. Figure 4 suggests looking at the case of large 5. With 
5 > 2 < I>~ 1 (y^7t), cr 2 (vrc) is decreasing in ire, an d so the first term in (8) is 
positive but bounded above by 

a(n) 

1 >0. 



a(l) 

Being positive, this term works against having A < 0, but the term is fi- 
nite and not dependent on category size m. Thus, the second term in (8) 
dominates for sufficiently large m, and will drive A < as long as 

(10) ^>*. 

With the effect 5 larger than the critical point mentioned above, we have 

a 2 {7r c ) < (t 2 (tt) < a 2 {0) = no(l ~ Ho) < «/(l + k) 2 
and so (10) holds if 



m\ Mi - Mo . e 

7WTW >S - 



Noting fxi = Mo/ K > rearranging (11) is equivalent to 



and finally, noting > k/(1 + k) in this case, we obtain the upper bound 5 < 
1/^JTi — \J~k. Thus, we have sufficient conditions establishing the superiority 
of selection over averaging in gene-set enrichment. Numerically we find the 
interval in Theorem 1 is nonempty for < k < 0.133. 

We prove the second claim similarly, and thus, in contrast to (10), we seek 
conditions under which 

(12) L= ^\-^ 2 <5 2 . 
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The left-hand side L can be simplified, noting first that (ii = ho/k and also 
that cr 2 (irc) = j«o{l + Kci^ ~ 1)} + 0(/jLq) as /io — > 0. Immediately we obtain 

v r/ (V^-l) 2 

lim L/un = -, TT-i 7T < °°- 

/pu 1 + tt c (1/k-1) 

On the other hand, if the right-hand side of (12) diverges when normalized 
by fiQ, then the stated inequality must hold for an interval of sufficiently 
small effects. Using Mills' ratio, we find the threshhold k ~ (— logK)/S for 
small 5 and, consequently, 



/i = 1 - 



5 f 1 (log AC) 2 

:eX P1 - n" 



(-logK)V27r 12 (5 2 

for small 5. Evaluating further shows that 5 2 /fiQ diverges as — > 0, com- 
pleting the proof. □ 

Computing notes. An R package, allez, was developed to implement 
random-set enrichment calculations, especially for GO categories. The source 
is available at http://www.stat.wisc.edu/~newton/. Calculations reported 
here used R [R Development Core Team (2005)] version 2.1 and Biocon- 
ductor [Gentleman et al. (2004)] package hugl33plus2 version 1.10.0, built 
October 2005. 
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